Hypothesis Testing

A note before we begin:

Hypothesis testing is a rather complicated mathematical process. It involves familiar knowledge with various distribution functions, especially the normal distribution function and ways to generate values from it with Cumulative Distribution Functions or Probability Distribution Functions. This notebook will move quickly to give a point of reference for how this kind of thing can be done "under the hood", but in all likelihood you will be using a library like scipy.stats to do this quickly and easily. So just hang on for the ride.

Acknowledgements:

A lot of the statistical functions were directly pulled from the book Data Science from Scratch. A fun book that's really well suited for this site, but not necessary reading to become a good data scientist. Nonetheless, credit is given where due.

Introduction

Perhaps one of the most useful, and certainly a concept in which you should be very fluent, is hypothesis testing. A lot of what scientists parade as fact are better described as "statistically significant." In Physics, gravity will pull something down 100% of the time. In the real world, however, there is significant randomness (which is a topic for another notebook).

Statistical inference, which is the process of inferring a distribution or a feature from sample observations, is a big an important field. Today, we're just going to talk about testing hypotheses to make sure our findings are statistically significant.

Here's how it works:

The Null Hypothesis

You've heard this before. The basic process of confidence testing is we put forth a null hypothesis and ask the data to provide enough evidence that we reject this null hypothesis. The term null signifies that the difference between the variables we are testing is null.

If the data does not provide sufficient evidence (or what we deem sufficient, typically with a score called a $p$-value) then we retain the null hypothesis. Note: we do not accept the null hypothesis, we just fail to reject the null hypothesis. Same same, but different.

The two variables we're testing are typically called the null and alternative hypotheses. In statistics, they are denoted $H_0$ and $H_A$ respectively. Here's a simple example with coin flips:

  • $H_0$: The coin is fair (has a probability $p=\frac{1}{2}$)
  • $H_A$: The coin is not fair (has a probability $p\ne\frac{1}{2}$)

So let's try this out by flipping a coin a bunch of times:


In [1]:
import random

random.seed(42)

def flip(n):
    results ={'heads': 0, 'tails': 0}
    for _ in range(n):
        coin = random.choice(['heads', 'tails'])
        results[coin] += 1
    return results

def print_proportions(flips):
    total = flips['heads'] + flips['tails']
    print("Heads p={}\nTails p={}".format(
        flips['heads']/total, flips['tails']/total))

print_proportions(flip(1000))


Heads p=0.486
Tails p=0.514

Even though we flipped the coin 1,000 times, we still didn't get perfect $p$-values of 0.5 each. So we want to test our null hypothesis by saying "let us assume that the probability of heads and tails are both 0.5. What are the odds that we then get the results we did? And is it so unlikely that we have to reject the hypothesis that the two coins have the same probability?"

The normal distribution

Remember our brief discussion of the normal distribution in descriptive statistics? We haven't gone deep into distributions yet, but here are some spoilers that you can feel free to take at face value:

  • The number of heads in an experiment of $n$ flips of a coin with probability $p$ will be normally distributed with mean $\mu = p*n$ and $\sigma = \sqrt{p*(1-p)*n}$

In [12]:
import math
def normal_approx_to_binomial(n, p):
    mu = p * n
    sigma = math.sqrt(p * (1 - p)  * n)
    return mu, sigma

normal_approx_to_binomial(1000, 0.5)


Out[12]:
(500.0, 15.811388300841896)
  • To find the probability that the number of heads is below a certain value, we use a Cumulative Distribution Function, or CDF. It simply says that given a $\mu$, $\sigma$, value $X$, and observed result $x$, what is $P(x \ge X)$?
  • We can use pythons math.erf() function to create a normal CDF (just hang on for the ride here, this is not the detailed subject of this notebook):

In [13]:
def normal_cdf(x, mu=0, sigma=1):
    return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2

Let's play around with this for a bit so you get a sense. Let's use our coin flip example were $\mu=500$ and $\sigma=15.81$ and see the probability that the number of heads is greater than various values.


In [28]:
mu, sigma = normal_approx_to_binomial(1000, 0.5)

for x in [450, 475, 490, 500, 510, 525, 550]:
    print("P(x <= {}) = {:0.5f}".format(x, normal_cdf(x, mu, sigma)))


P(x <= 450) = 0.00078
P(x <= 475) = 0.05692
P(x <= 490) = 0.26354
P(x <= 500) = 0.50000
P(x <= 510) = 0.73646
P(x <= 525) = 0.94308
P(x <= 550) = 0.99922

Look at the output above, what does it tell you? The probability that our observed value, $x$, is less than or equal to our test value, $X$. So in the case of 450, there is a 0.08% chance that we get less than 450 heads after 1,000 flips. That's pretty low. So we would be super confident that if we ran the trial and got 450 flips there's something strange going on with that coin.

That is the fundamentals of hypothesis testing. Now let's talk a bit more about that percentage number and what it means to statisticians.

The $p$-value

The numbers we caluclated in the previous cell are referred to as probability values (or $p$-values) by statisticians. They are crucially important. Most scientific studies that tout a finding to be statistically significant are technically saying they got a low enough $p$-value to reject the null hypothesis.

Interpretation

Stop for a second and think about what it means in technical language: the probability that our sample statistic is the result of chance is lower than some given threshold, thus we assert it was not sourced from the model distribution.

And just as important is the reverse: the probability that our sample statistic is the result of chance is higher than a given threshold, thus we retain the alternative hypothesis.

This language, thus we retain the alternative hypothesis, is very important. We are not saying that the two are the same, just that we cannot be certain they are the same. In fact, very few things in statistics are the same unless they come from near-perfect random distributions like dice and coins and playing cards. When we talk about people, there's a lot more blur aorund the edges.

Making mistakes

Two very important problems result now in hypothesis testing: to get the model distribution (sometimes called the test statistic) correct and to pick a reasonable threshold, or $p$-value.

If these are wrong, there are two types of errors the experimenters will be more subject to:

  • Type I: Rejecting the Null Hypothesis when it is in fact true
  • Type II: Retaining the Null Hypothesis when it is in fact false.

In [ ]: